Trainer + reference model in bf16 #189

joecummings · 2025-09-19T15:35:48Z

What does this PR do?

The PR defaults the dtype of the Trainer and ReferenceModel to bf16
I also slipped in a change which lets training proceed for as long as needed (toggled by the steps param in trainer. DW about it :)

How do we know this works?

The primary way we know this works is by examining the memory that is taken up when running the models. I confirmed that it is about half by looking at nvtop logs. Luckily we also have another easy way to confirm this works b/c when you try to calculate rms norm with an input in bfloat16 and a weight in fp32, it shows this error:

[0] /home/jrcummings/.fbpkg_conda_envs/forge-a7401c7/lib/python3.10/site-packages/torch/nn/functional.py:2920: UserWarning: Mismatch dtype between input and weight: input dtype = c10::BFloat16, weight dtype = float, Cannot dispatch to fused implementation. (Triggered internally at /mnt/code/pytorch/aten/src/ATen/native/layer_norm.cpp:344.)
[0]   return torch.rms_norm(input, normalized_shape, weight, eps)

When this change is merged, the error goes away.

FAQs

Does this work for single device? What an astute question: NO. Distributed APIs handle the conversion to a lower dtype, so if you don't use the distributed APIs it will keep things in fp32. This is annoying, for sure, but not blocking. Keep tracking this issue for more information.
What about training stability? Fair play. While it is common practice to post-train in bf16, people have raised concerns that performance is worse than fp32. See here. Experiments before and after this change don't raise any red flags, but I would consider this part of the ongoing "correctness" work to ensure this doesn't cause any problems. cc @Ritesh1905

allenwang28 · 2025-09-19T21:05:14Z

Does this work for single device? What an astute question: NO.

Isn't the 1.7B example still using single device though?

Trainer + reference model in bf16

c906267

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2025

joecummings marked this pull request as ready for review September 19, 2025 21:00

Ritesh1905 approved these changes Sep 19, 2025

View reviewed changes

joecummings merged commit 605f85f into meta-pytorch:main Sep 19, 2025
5 checks passed

joecummings deleted the train-in-bf16 branch September 19, 2025 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trainer + reference model in bf16 #189

Trainer + reference model in bf16 #189

Uh oh!

joecummings commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

allenwang28 commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Trainer + reference model in bf16 #189

Trainer + reference model in bf16 #189

Uh oh!

Conversation

joecummings commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How do we know this works?

FAQs

Uh oh!

Uh oh!

allenwang28 commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joecummings commented Sep 19, 2025 •

edited

Loading